Abstract: Fine-tuning neural networks has historically been quite slow and cumbersome on CPUs. However, with mixed precision BF16 training and the Intel® Extension for PyTorch*, fine-tuning is feasible on a CPU and perhaps even preferred where cost and availability are key factors. In this tutorial, I will walk you through a real-world example of training an AI image segmentation model using PyTorch* 1.13.1 (with ResNet34 + UNet architecture); the model will learn to identify roads and speed limits from satellite images, all on the newly released 4th Gen Intel® Xeon® Scalable Processor (Sapphire Rapids).
Author: Ben Consolvo | Company: Intel |
Date: March 4, 2023
Please join me on Intel's Developer Discord to have further discussion following the talk. My user is silvos#5002
Invite link: https://discord.gg/rv2Gp55UJQ
You can also connect with me here:
Introduction
I am excited to be able to show you today that CPUs are viable to train a deep learning model for a fine-tuning example.
In this tutorial, you will learn how to accelerate a PyTorch training job with Sapphire Rapids. We will use the Intel Extension for PyTorch (IPEX) library to enable built-in AI acceleration on the Sapphire Rapids CPU to work with minimal code changes. You will learn how to train a satellite image with matching street labels (pixel segmentation task) example.
The potential cost savings for renting a CPU on one of the major CSPs, instead of a GPU, are significant. However, I will say that the latest CPU processors are still being rolled out to the CSPs, and it is typical for there to be a lag time to availability on the major CSPs from launch-date. The Sapphire Rapids CPU I will show you today is a beast! It is being hosted on the Intel® Developer Cloud*, which you can sign up for the Beta here: cloud.intel.com.
Coming soon, I will provide a new tutorial around using PyTorch 2.0 after its release on March 14, 2023.
Here are the results of the tests I ran to understand how Sapphire Rapids
Key References
Much of my material and code was taken from the CRESI repository below. I've adapted it for use on Sapphire Rapids, with optimizations from Intel Extension for PyTorch.
In particular, I was able to piece together a workflow using the notebooks here:
I also highly recommend these Medium articles for another detailed explanation of how to get started with the SpaceNet5 data:
I also referenced 2 Hugging Face blogs by Julien Simon here. He ran his tests on the AWS* instance r7iz.metal-16xl :
4th Gen Intel® Xeon® Scalable Processor (Sapphire Rapids) - CPU specs
Operating System:
Key Software:
Model Architecture:
!lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 224
On-line CPU(s) list: 0-223
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8480+
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 56
Socket(s): 2
Stepping: 8
CPU max MHz: 3800.0000
CPU min MHz: 800.0000
BogoMIPS: 4000.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
a cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss
ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art
arch_perfmon pebs bts rep_good nopl xtopology nonstop_
tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes6
4 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xt
pr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_
deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3d
nowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpci
d_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibr
s_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad
fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid c
qm rdt_a avx512f avx512dq rdseed adx smap avx512ifma cl
flushopt clwb intel_pt avx512cd sha_ni avx512bw avx512v
l xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc
cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni
avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_ac
t_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke
waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni a
vx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_d
etect cldemote movdiri movdir64b enqcmd fsrm md_clear s
erialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16
amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 5.3 MiB (112 instances)
L1i: 3.5 MiB (112 instances)
L2: 224 MiB (112 instances)
L3: 210 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-55,112-167
NUMA node1 CPU(s): 56-111,168-223
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer
sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB fillin
g, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Not affected
Under "flags", you can see amx_bf16, amx_tile, and amx_int8, so you know AMX is available for use.
The environment setup will depend on what already comes installed on the machine you are using, but here is a guide on what I did. To simplify the environment setup, I wrote out here what I did to get my environment set up properly. Please note that you may need to make adjustments depending on the build and configuration of your machine. I have just left these to be run in the command line, as opposed to in the notebook, as the text output gets too long for a notebook.
To get access to the latest Intel computing hardware: - You can sign up for the beta here: cloud.intel.com - Head over to the Get Started page on how to set up an instance - I used the "4th Generation Intel® Xeon® Scalable processors" instance.
apt-get Installing apt-get packages
#Fix broken installs (if any) for apt-get
sudo apt-get --fix-broken install
#Update apt-get
sudo apt-get update
sudo apt-get autoremove
#Install packages
sudo apt-get install -y \
cdo \
nco \
gdal-bin \
libgdal-dev \
libjemalloc-dev \
awscli \
cmake \
apt-utils \
python3-dev \
libssl-dev \
libffi-dev \
libncurses-dev \
libgl1 \
ffmpeg \
libsm6 \
libxext6 \
numactl
Building Anaconda environment and installing some packages with conda install.
#Download and install Miniconda distribution of Anaconda
wget https://repo.anaconda.com/miniconda/Miniconda3-py38_23.1.0-1-Linux-x86_64.sh -O ~/miniconda.sh && \
sudo /bin/bash ~/miniconda.sh -b -p /opt/conda && \
rm ~/miniconda.sh && \
sudo /opt/conda/bin/conda clean -tip && \
sudo ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
sudo echo ". /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
source ~/.bashrc
#Create new conda environment
conda create --name py39 python=3.9
echo "conda activate py39" >> ~/.bashrc
source ~/.bashrc
#install conda packages
conda install -c conda-forge libgdal
conda install tiledb=2.2
conda install poppler
conda install intel-openmp
conda install gperftools -c conda-forge
Installing PyTorch for CPU, as well as a requirements.txt file of packages
#pip upgrades
python -m pip install --upgrade pip wheel
pip3 install setuptools==57.5.0 #for 3rd gen xeon
#install torch for CPU
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
#install requirements packages
pip3 install -r requirements.txt
#put the newly created conda environment into Jupyter*
python -m ipykernel install --user --name=py39
#install oneCCL if want to launch more complicated (distributed) jobs
#python -m pip install oneccl_bind_pt==1.13 -f https://developer.intel.com/ipex-whl-stable-cpu
The requirements.txt is:
scikit-image
jupyterlab
ipykernel
numpy
pandas
scipy
matplotlib
fiona
opencv-python
shapely
imagecodecs
tqdm
osmnx
torchsummary
geopandas==0.6.3
tensorboardX
tensorboard
networkx==2.8
numba
utm
intel_extension_for_pytorch
gdal==3.0.4
The cresi repository contains a lot of the code needed to run training and inference in this demo. Navigate to the folder where you want to store the cresi repo and go ahead and clone it with:
git clone https://github.com/avanetten/cresi
For this tutorial we will use public imagery from the SpaceNet 5 Challenge. These weights and images are are part of the Registry of Open Data on AWS, and can be downloaded for free. You will need an AWS account to access the data, and the AWS CLI tool installed. Then, simply execute aws configure from the command line and input your AWS Access Key ID and your AWS Secret Access Key.
Once AWS CLI is setup, you should be able to download the dataset directly from the public S3 bucket below. There are both train and test datasets listed below. The train datasets contain labels, while the test datasets do not contain labels. The commands below list all of the available relevant data, zipped up into .tar.gz files.
!aws s3 ls s3://spacenet-dataset/spacenet/SN5_roads/tarballs/ --human-readable
2019-09-03 20:59:32 5.8 GiB SN5_roads_test_public_AOI_7_Moscow.tar.gz 2019-09-24 08:43:02 3.2 GiB SN5_roads_test_public_AOI_8_Mumbai.tar.gz 2019-09-24 08:43:47 4.9 GiB SN5_roads_test_public_AOI_9_San_Juan.tar.gz 2019-09-14 13:13:26 35.0 GiB SN5_roads_train_AOI_7_Moscow.tar.gz 2019-09-14 13:13:34 18.5 GiB SN5_roads_train_AOI_8_Mumbai.tar.gz
!aws s3 ls s3://spacenet-dataset/spacenet/SN3_roads/tarballs/ --human-readable
2019-08-23 12:09:19 728.8 MiB SN3_roads_sample.tar.gz 2019-08-23 11:34:23 8.1 GiB SN3_roads_test_public_AOI_2_Vegas.tar.gz 2019-08-23 11:35:30 1.8 GiB SN3_roads_test_public_AOI_3_Paris.tar.gz 2019-08-23 11:30:48 8.0 GiB SN3_roads_test_public_AOI_4_Shanghai.tar.gz 2019-08-23 11:31:55 1.6 GiB SN3_roads_test_public_AOI_5_Khartoum.tar.gz 2019-08-21 10:27:25 24.3 GiB SN3_roads_train_AOI_2_Vegas.tar.gz 2019-09-03 12:00:38 1.4 MiB SN3_roads_train_AOI_2_Vegas_geojson_roads_speed.tar.gz 2019-08-21 10:27:25 5.5 GiB SN3_roads_train_AOI_3_Paris.tar.gz 2019-09-03 12:01:25 234.7 KiB SN3_roads_train_AOI_3_Paris_geojson_roads_speed.tar.gz 2019-08-21 10:27:25 24.0 GiB SN3_roads_train_AOI_4_Shanghai.tar.gz 2019-09-03 12:01:47 1.6 MiB SN3_roads_train_AOI_4_Shanghai_geojson_roads_speed.tar.gz 2019-08-21 10:27:25 4.9 GiB SN3_roads_train_AOI_5_Khartoum.tar.gz 2019-09-03 12:02:16 486.4 KiB SN3_roads_train_AOI_5_Khartoum_geojson_roads_speed.tar.gz
As you can see above, some of these files are multiple GBs, so please be aware of your space on your local disk before downloading.
An example of downloading a specific file and unzipping
aws s3 cp s3://spacenet-dataset/spacenet/SN5_roads/tarballs/SN5_roads_train_AOI_7_Moscow.tar.gz .
tar -xvzf ~/spacenet5data/moscow/SN5_roads_train_AOI_7_Moscow.tar.gz
An example of downloading a whole folder from S3
!aws s3 cp s3://spacenet-dataset/spacenet/SN3_roads/tarballs/ . --recursive
I have some instructions for launching Jupyter Lab on the Intel Developer Cloud instance, and then how you would connect in from your local machine on a web browser. After you have installed Jupyter Lab from the previous steps, from the command line, you can launch a Jupyter server. I prefer to do this in a tmux window, so that I can exit out still use the same terminal window. You can read more about tmux here on their GitHub. You can use the provided script:
jupyter lab --port=8080 --ServerApp.ip=* --no-browser
In order to be able to connect to the Jupyter server on a local browser, you can create an SSH tunnel with a command like the following in a different terminal window (connecting your local machine to the remote machine):
ssh -J guest@146.152.226.42 -L 8080:localhost:8080 devcloud@192.168.19.2
You should now be able to go to a local browser and enter the URL below to access your Jupyter server
https://localhost:8080
#Command to show how much room you have available on disk
!df -h
Filesystem Size Used Avail Use% Mounted on tmpfs 51G 4.0M 51G 1% /run /dev/nvme0n1p2 3.5T 415G 2.9T 13% / tmpfs 252G 160K 252G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock /dev/nvme0n1p1 537M 5.3M 532M 1% /boot/efi tmpfs 51G 4.0K 51G 1% /run/user/1000
#importing packages
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import skimage.io
import json
import pandas as pd
%load_ext autoreload
%autoreload 2
The images have names like the following: SN5_roads_train_AOI_7_Moscow_PS-RGB_chip180.tif
SN5 means SpaceNet 5PS-MS means pan-sharpened (PS) 8-band multispectral (MS) imagePS-RGB means pan-sharpened (PS) 3-band RGB imageThe 8-band multispectral images are of shape (1300,1300,8), whereas the RGB images are of shape (1300,1300,3). Here in the notebook I only show the 3-channel RGB images, but feel free to explore ways to visualize the 8-band images.
Each "clip" of a satellite image has corresponding labels of where the streets are located. We need will teach a neural network about, so it can predict where streets are based on the satellite image alone.
After unzipping the
SN5_roads_train_AOI_7_Moscow.tar.gzfile, you should find thegeojson_roads_speedfolder within each area of interest (AOI) directory (nfs/data/cosmiq/spacenet/competitions/SN5_roads/tiles_upload/train/AOI_7_Moscow/geojson_roads_speed). It contains road centerline labels along with estimates of safe travel speeds for each roadway. We'll use these centerline labels and speed estimates to create training masks. We assume a mask buffer of 2 meters, meaning that each roadway is assigned a total width of 4 meters. Remember that the goal of our segmentation step is to detect road centerlines, so while this is not the precise width of the road, a buffer of 2 meters is an appropriate width for our segmentation model.
One option for training a segmentation model is to create training masks where the value of the mask is proportional to the speed of the roadway. This can be accomplished by running the
speed_masks.pyscript (https://github.com/avanetten/cresi/blob/main/cresi/data_prep/speed_masks.py).
python3 /home/devcloud/cresi/cresi/data_prep/speed_masks.py --geojson_dir=/home/devcloud/spacenet5data/moscow/data/geojson_roads_speed \
--image_dir=/home/devcloud/spacenet5data/moscow/data/PS-MS \
--output_conversion_csv_binned=/home/devcloud/spacenet5data/moscow/data/v6/output_conversion_csv_binned/sn5_roads_train_speed_conversion_binned.csv \
--output_mask_dir=/home/devcloud/spacenet5data/moscow/data/v6/train_mask_binned \
--output_mask_multidim_dir=/home/devcloud/spacenet5data/moscow/data/v6/train_mask_binned_mc \
--buffer_distance_meters=2 \
--crs=None
Displaying Satellite Images and Corresponding Image Masks
Below, I am showing a sample of some of the satellite images in RGB format (each image is an array of shape (1300,1300,3), where 1300x1300 is the pixel count, and 3 are the three R-G-B color channels.
The generated mask images are the corresponding masks to these specific clips. The different colors of the roads indicate different speed limits.
#Showing some sample images
image_dir = '/home/devcloud/spacenet5data/moscow/train_data/PS-RGB' #replace with your local path to the images
image_names = ['SN5_roads_train_AOI_7_Moscow_PS-RGB_chip180.tif',
'SN5_roads_train_AOI_7_Moscow_PS-RGB_chip181.tif',
'SN5_roads_train_AOI_7_Moscow_PS-RGB_chip182.tif',
'SN5_roads_train_AOI_7_Moscow_PS-RGB_chip183.tif']
full_path_images = [Path(image_dir,img) for img in image_names]
fig = plt.figure(figsize=(30, 10))
columns = 4
rows = 1
ax = []
for idx,num in enumerate(range(columns*rows)):
img = skimage.io.imread(full_path_images[idx])
ax.append(fig.add_subplot(rows,columns,num+1))
ax[-1].set_title(image_names[idx])
plt.imshow(img)
plt.show()
print(f'Image array shape is {img.shape}')
Image array shape is (1300, 1300, 3)
#Showing some sample images
image_dir = '/home/devcloud/spacenet5data/moscow/v10/train_mask_binned/' #replace with your local path to the images
image_names = ['SN5_roads_train_AOI_7_Moscow_PS-MS_chip180.tif',
'SN5_roads_train_AOI_7_Moscow_PS-MS_chip181.tif',
'SN5_roads_train_AOI_7_Moscow_PS-MS_chip182.tif',
'SN5_roads_train_AOI_7_Moscow_PS-MS_chip183.tif']
full_path_images = [Path(image_dir,img) for img in image_names]
fig = plt.figure(figsize=(30, 10))
columns = 4
rows = 1
ax = []
for idx,num in enumerate(range(columns*rows)):
img = skimage.io.imread(full_path_images[idx])
ax.append(fig.add_subplot(rows,columns,num+1))
ax[-1].set_title(image_names[idx])
plt.imshow(img)
plt.show()
print(f'Image array shape is {img.shape}')
Image array shape is (1300, 1300)
Building Configuration JSON for train/validation split and training parameters
First, we need to build a JSON configuration file that defines how we will split up the image data into training and validation sets, and define our training parameters. The location of the default configuration file should be found in your cloned cresi directory. You can find a sample config here:
https://github.com/avanetten/cresi/blob/main/cresi/configs/sn5_baseline_aws.json.
My edited config file looks like:
{
"path_src": "/home/devcloud/cresi/cresi",
"path_results_root": "/home/devcloud/spacenet5data/moscow/v10_xeon4_devcloud22.04",
"train_data_refined_dir_ims": "/home/devcloud/spacenet5data/moscow/train_data/PS-MS",
"train_data_refined_dir_masks": "/home/devcloud/spacenet5data/moscow/v10/train_mask_binned_mc",
"speed_conversion_file": "/home/devcloud/spacenet5data/moscow/v10/output_conversion_csv_binned/sn5_roads_train_speed_conversion_binned.csv",
"folds_file_name": "folds4.csv",
"save_weights_dir": "sn5_baseline",
"num_folds": 1,
"default_val_perc": 0.2,
"num_channels": 8,
"num_classes": 8,
"network": "resnet34",
"loss": {
"soft_dice": 0.25,
"focal": 0.75
},
"early_stopper_patience": 8,
"nb_epoch": 30,
"test_data_refined_dir": "/home/devcloud/spacenet5data/moscow/test_data/PS-MS",
"test_results_dir": "sn5_baseline",
"folds_save_dir": "folds",
"tile_df_csv": "tile_df.csv",
"test_sliced_dir": "",
"slice_x": 0,
"slice_y": 0,
"stride_x": 0,
"stride_y": 0,
"skeleton_band": 7,
"skeleton_thresh": 0.3,
"min_subgraph_length_pix": 20,
"min_spur_length_m": 10,
"GSD": 0.3,
"rdp_epsilon": 1,
"log_to_console": 1,
"intersection_band": -1,
"use_medial_axis": 0,
"merged_dir": "merged",
"stitched_dir_raw": "stitched/mask_raw",
"stitched_dir_count": "stitched/mask_count",
"stitched_dir_norm": "stitched/mask_norm",
"wkt_submission": "wkt_submission_nospeed.csv",
"skeleton_dir": "skeleton",
"skeleton_pkl_dir": "sknw_gpickle",
"graph_dir": "graphs",
"padding": 22,
"eval_rows": 1344,
"eval_cols": 1344,
"batch_size": 64,
"iter_size": 1,
"lr": 0.0001,
"lr_steps": [
20,
25
],
"lr_gamma": 0.2,
"test_pad": 64,
"epoch_size": 8,
"predict_batch_size": 33,
"target_cols": 512,
"target_rows": 512,
"optimizer": "adam",
"warmup": 0,
"ignore_target_size": false
}
Code to load JSON into Jupyter:
training_config = '/home/devcloud/cresi/cresi/configs/ben/v10_xeon4_baseline_ben.json'
f = open(training_config)
obj = json.load(f)
print(json.dumps(obj, indent=4))
Here are a few elaborations on some of these parameters:
| parameter | description |
|---|---|
path_src |
directory to local clone of cresi repository folder |
path_results_root |
directory where all results are saved, including the split of training and validation data, and trained model weights. |
train_data_refined_dir_ims |
directory where all of the training images reside. For example, SN5_roads_train_AOI_7_Moscow_PS-MS_chip1241.tif is one of the images in this directory. |
train_data_refined_dir_masks |
directory where the mask images reside. These images are generated with the speed_masks.py script referenced earlier. The images should have the exact same name as the training images. |
speed_conversion_file |
path to the binned speed conversion CSV file. It should have 3 columns and look like: burn_val,speed,channel with the first line looking like 36,1,0. Speeds are in miles per hour, and there are 7 possible output channels for speeds, and the 8th channel is an aggregate of the rest. |
folds_file_name |
This file will be saved in the path_results_root / weights / save_weights_dir directory. It is simply a list of the image file names, along with which fold number. SN5_roads_train_AOI_7_Moscow_PS-MS_chip379.tif,0. If the fold number is 0 in our case, it is part of the validation set. |
save_weights_dir |
Directory where model weights are saved in path_results_root / weights path. |
num_folds |
Number of validation fold datasets. "Using multiple cross-validated models can improve performance, though at the expense of increased run time at inference. For our baseline we use only a single fold, and randomly withhold 20% of the data for validation purposes during training. We use this validation data to test performance after each epoch, and truncate training if the validation loss does not decrease for 8 epochs." (Source) |
default_val_perc |
Percentage of data for validation during training. |
num_channels |
The number of channels in the training image. For PS-MS pan-sharpened (PS) 8-band multispectral (MS) images, you can put 8. If using PS-RGB pan-sharpened (PS) 3-band RGB images, you can put 3. |
num_classes |
Number of bins of speed limits. In our case, we have binned the data into 8 speed limit buckets. |
network |
Neural network architecture. Choices at the time of writing are: resnet34, resnet50, resnet101,seresnet50,seresnet101,seresnet152, seresnext50, or seresnext101. The choice of model architecture will influence the speed and accuracy of training. |
loss |
Type of loss function. "Custom loss function comprised of 25% Dice Loss and 75% Focal Loss." (Source) |
nb_epoch |
Number of epochs |
batch_size |
Number of images to use in one batch to load into memory. Note that this can be increased or decreased depending on your memory requirements. |
optimizer |
Optimizers at the time of writing that can be chosen are: adam, rmsprop, and sgd. |
There are numerous other configuration parameters that can be adjusted for training time and inference time, but we will leave those alone.
Creating Validation Fold
We need to create the validation fold of data. To do that, we can use the 00_gen_folds.py script (https://github.com/avanetten/cresi/blob/main/cresi/00_gen_folds.py) with the previously configured parameter file. For example, I ran:
!python3 /home/devcloud/cresi/cresi/00_gen_folds.py /home/devcloud/cresi/cresi/configs/ben/v5_sn5_baseline_ben.json
The output is a folds CSV file: /home/devcloud/spacenet5data/moscow/results_xeon4_devcloud/weights/sn5_baseline/folds4.csv
There are folds 0, 1, 2, 3, and 4. The validation data are fold 0 (20%), and the training data are folds 1-4 (80%).
folds = '/home/devcloud/spacenet5data/moscow/v10_xeon4_devcloud22.04/weights/sn5_baseline/folds4.csv'
pd.read_csv(folds)
| Unnamed: 0 | fold | |
|---|---|---|
| 0 | SN5_roads_train_AOI_7_Moscow_PS-MS_chip842.tif | 0 |
| 1 | SN5_roads_train_AOI_7_Moscow_PS-MS_chip1327.tif | 1 |
| 2 | SN5_roads_train_AOI_7_Moscow_PS-MS_chip561.tif | 2 |
| 3 | SN5_roads_train_AOI_7_Moscow_PS-MS_chip91.tif | 3 |
| 4 | SN5_roads_train_AOI_7_Moscow_PS-MS_chip1263.tif | 4 |
| ... | ... | ... |
| 1348 | SN5_roads_train_AOI_7_Moscow_PS-MS_chip843.tif | 3 |
| 1349 | SN5_roads_train_AOI_7_Moscow_PS-MS_chip932.tif | 4 |
| 1350 | SN5_roads_train_AOI_7_Moscow_PS-MS_chip1032.tif | 0 |
| 1351 | SN5_roads_train_AOI_7_Moscow_PS-MS_chip1168.tif | 1 |
| 1352 | SN5_roads_train_AOI_7_Moscow_PS-MS_chip761.tif | 2 |
1353 rows × 2 columns
I had to make some minor code changes to the cresi repo, described below, in order to run with the new Intel CPU with AMX.
self.model = nn.DataParallel(model).cuda() with self.model = nn.DataParallel(model)torch.randn(10).cuda()Optimize the training code with Intel Extension for PyTorch, to get the most benefit out of training on a CPU.
In https://github.com/avanetten/cresi/blob/main/cresi/net/pytorch_utils/train.py:
Add import intel_extension_for_pytorch as ipex with the import statements
Add line for IPEX optimization and bfloat16 for mixed precision training just after defining the model and optimizer:
self.model = nn.DataParallel(model) #.cuda() only for GPU
self.optimizer = optimizer(self.model.parameters(), lr=config.lr)
self.model, self.optimizer = ipex.optimize(self.model, optimizer=self.optimizer,dtype=torch.bfloat16)
Add a line to do mixed precision on CPU just before running a forward pass and calculating the loss function:
with torch.cpu.amp.autocast():
if verbose:
print("input.shape, target.shape:", input.shape, target.shape)
output = self.model(input)
meter = self.calculate_loss_single_channel(output, target, meter, training, iter_size)
import intel_extension_for_pytorch as ipex in the importsmodel = torch.load(os.path.join(path_model_weights, 'fold{}_best.pth'.format(fold)), map_location=lambda storage, loc: storage)
model.eval()
model = ipex.optimize(model,dtype=torch.bfloat16)with torch.no_grad():
with torch.cpu.amp.autocast():
for data in pbar:
samples = torch.autograd.Variable(data['image'], volatile=True)
predicted = predict(model, samples, flips=self.flips)
.view functions with .reshape.F.sigmoid (or torch.nn.functional.sigmoid) with torch.sigmoid.clip_grad_norm with .clip_grad_norm_self.estimator.lr_scheduler.step(epoch) with self.estimator.lr_scheduler.step()autograd is no longer needed:input = torch.autograd.Variable(input.cuda(async=True), volatile=not training)target = torch.autograd.Variable(target.cuda(async=True), volatile=not training)Let's recap. We have:
I am training the whole ResNet34 + UNet backbone, and keeping no layers frozen during training. We now can begin our training run with the 01_train.py script (original here: https://github.com/avanetten/cresi/blob/main/cresi/01_train.py).
ipexrun --ninstances 1 --ncore_per_instance 32 /home/devcloud/cresi/cresi/01_train.py /home/devcloud/cresi/cresi/configs/ben/v10_xeon4_baseline_ben.json --fold=0
The ipexrun is a simplified way of using the numactl command line tool to specify which hardware configurations to use. In my case, I am asking my training run to:
--ninstances 1)--ncore_per_instance 32)I found that the training performed better while staying on 1 socket, because of the slow timing of communicating across sockets. I also found that going up to the maximum of 56 physical cores did improve the epoch time slightly to 14.5 minutes from 15.75 minutes, but that overall keeping the training to 1 socket rather than across 2 sockets is what mattered most.
During Training Run
During training, you can launch TensorBoard with the appropriate log directory to monitor the loss function progress over time:
tensorboard --logdir /home/devcloud/spacenet5data/moscow/v10_xeon4_devcloud22.04/logs/sn5_baseline/fold0 --port 8090
And then you can create another SSH tunnel:
ssh -J guest@146.152.226.42 -L 8090:localhost:8090 devcloud@192.168.19.2
and launch in your local browser at https://localhost:8090. It should look like:
Training Run Results
To run inference, we can use the 02_eval.py script (https://github.com/avanetten/cresi/blob/main/cresi/02_eval.py). Remember that we did modify a few lines to accomodate AMX and BF16 (see Codebase Changes above).
python3 /home/devcloud/cresi/cresi/02_eval.py /home/devcloud/cresi/cresi/configs/ben/v10_xeon4_baseline_ben.json
Here is a sample mask output that was predicted.
img_path = '/home/devcloud/spacenet5data/moscow/v10_xeon4_devcloud22.04/sn5_baseline/folds/fold0_SN5_roads_test_public_AOI_7_Moscow_PS-MS_chip95.tif'
# inspect
mask_pred = skimage.io.imread(img_path)
print("mask_pred.shape:", mask_pred.shape)
# plot all layers
fig, axes = plt.subplots(2, 4, figsize=(16, 9))
for i, ax in enumerate(axes.flatten()):
if i < (len(axes.flatten()) - 1):
title = 'Mask Channel {}'.format(str(i))
else:
title = 'Aggregate'
ax.imshow(mask_pred[i,:,:])
ax.set_title(title)
mask_pred.shape: (8, 1300, 1300)
Here is the original image and the mask aggregate prediction.
full_path_image = '/home/devcloud/spacenet5data/moscow/test_data/PS-RGB/SN5_roads_test_public_AOI_7_Moscow_PS-RGB_chip95.tif'
full_path_mask = '/home/devcloud/spacenet5data/moscow/v10_xeon4_devcloud22.04/sn5_baseline/folds/fold0_SN5_roads_test_public_AOI_7_Moscow_PS-MS_chip95.tif'
img1 = skimage.io.imread(full_path_image)
img2 = skimage.io.imread(full_path_mask)[7,:,:]
f, axarr = plt.subplots(1,2,figsize=(16, 9))
axarr[0].imshow(img1)
axarr[0].set_title('Satellite Image chip 95')
axarr[1].imshow(img2)
axarr[1].set_title('Prediction Mask chip 95')
Text(0.5, 1.0, 'Prediction Mask chip 95')
I realize that the model that I've trained is overfit on the Moscow image data and likely should not generalize well to other cities. However, the winning solution to this challenge used data from 6 cities (Las Vegas, Paris, Shanghai, Khartoum, Moscow, Mumbai) and performed well on a new city.
In the future, one thing that would be worth testing is training on all 6 cities and running inference on another city to reproduce their results.
Merge, stitch, skeletonize
There are further post-processing steps that can be performed to add the mask as graph features to maps. You can read more about the post-processing steps here:
And the post-processing scripts can be found here:
The text outputs are quite verbose, and so I recommend doing these in a separate terminal (and with tmux), rather than in a notebook.
python3 /home/devcloud/cresi/cresi/03a_merge_preds.py /home/devcloud/cresi/cresi/configs/ben/v10_xeon4_baseline_ben.json
python3 /home/devcloud/cresi/cresi/03b_stitch.py /home/devcloud/cresi/cresi/configs/ben/v10_xeon4_baseline_ben.json
python3 /home/devcloud/cresi/cresi/04_skeletonize.py /home/devcloud/cresi/cresi/configs/ben/v10_xeon4_baseline_ben.json
python3 /home/devcloud/cresi/cresi/05_wkt_to_G.py /home/devcloud/cresi/cresi/configs/ben/v10_xeon4_baseline_ben.json
Summary of accomplishments:
This fine-tuning of an image segmentation model (for a ResNet34 + UNet model) led to the following conclusions for me:
You can find much more detailed benchmarks here for Sapphire Rapids, including statements concerning the NVIDIA A100 GPU: https://edc.intel.com/content/www/us/en/products/performance/benchmarks/4th-generation-intel-xeon-scalable-processors/
For me as a data scientist, I more care about the fact that I can do a training run over night (or during a workday) than about specific performance benchmarks. The new Sapphire Rapids CPU brings home the new reality of fine-tuning training on a CPU, which is exciting. Happy coding!
Start by using the Intel Extension for PyTorch:
pip install intel-extension-for-pytorch
git clone https://github.com/intel/intel-extension-for-pytorch
Please let me know if you used the code by connecting with me on one of the following places:
Please join me on Intel's Developer Discord to have further discussion following the talk. My user is silvos#5002
Invite link: https://discord.gg/rv2Gp55UJQ
Important Legal Notices Performance varies by use, configuration and other factors.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See configuration disclosure for details. No product or component can be absolutely secure.
Intel optimizations, for Intel compilers or other products, may not optimize to the same degree for non-Intel products.
Your costs may vary.
Intel technologies may require enabled hardware, software or service activation.
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
Altering clock frequency or voltage may void any product warranties and reduce stability, security, performance, and life of the processor and other components. Check with system and component manufacturers for details.
Intel contributes to the development of benchmarks by participating in, sponsoring, and/or contributing technical support to various benchmarking groups, including the BenchmarkXPRT Development Community administered by Principled Technologies.
Read our Benchmarks and Measurements disclosures and our Battery Life disclosures.
Statements on Intel's websites that refer to future plans or expectations are forward-looking statements. These statements are based on current expectations and involve many risks and uncertainties that could cause actual results to differ materially from those expressed or implied in such statements. For more information on the factors that could cause actual results to differ materially, see our most recent earnings release and SEC filings at https://www.intc.com/.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.